Dynamic Biobanking for Advancing Breast Cancer Research

Longitudinal patient biospecimens and data advance breast cancer research through enabling precision medicine approaches for identifying risk, early diagnosis, improved disease management and targeted therapy. Cancer biobanks must evolve to provide not only access to high-quality annotated biospecimens and rich associated data, but also the tools required to harness these data. We present the Breast Cancer Now Tissue Bank centre at the Barts Cancer Institute as an exemplar of a dynamic biobanking ecosystem that hosts and links longitudinal biospecimens and multimodal data including electronic health records, genomic and imaging data, offered alongside integrated data sharing and analytics tools. We demonstrate how such an ecosystem can inform precision medicine efforts in breast cancer research.


Introduction
Approximately one in eight women are diagnosed with breast cancer in their lifetime [1]. Despite recent advances in early detection and treatment of BC, and extensive histopathological and molecular characterisation of primary tumours, around 30% of patients go on to experience recurrence [2] or metastasis which is considered incurable. These figures are expected to rise in the upcoming years due to the disruptions in screening programmes and treatments related to the COVID-19 pandemic, with nearly 1 million women in the UK having missed their mammogram appointments [3]. While effects of the COVID-19 pandemic amplified existing challenges and health inequalities, increasing demand for innovative solutions, the risk factors for breast cancer remain not well-understood. This is particularly evident in multi-ethnic communities who are underrepresented in or absent from large studies. Cancer biobanks are important biomedical research infrastructures for preclinical, translational and clinical studies that can help address these challenges.
As we move forward in the era of precision medicine, cancer biobanks are required to adopt new strategies for patient recruitment and collection, processing, storage and distribution of biospecimens and data to better contribute to this research area and the wider healthcare landscape. Linkage to other data sources such as electronic health records (EHRs) is vital to complement a biobank's clinical data. Healthcare datasets, as managed in EHR systems, are typically stored in highly variable formats. There is homogeneity in terms of structure and adoption of standard biomedical ontologies (e.g., ICD [4], OPCS [5] and SNOMED-CT [6]); however, a notable portion of EHRs (e.g., discharge summaries) is in a free-text format. While these unstructured data are often used to report the results of examinations, tests or procedures, and serve as a means of communication between healthcare providers, they contain valuable information that is not possible to obtain from their structured counterparts. Such information can be extracted using artificial intelligence 2 of 12 algorithms and stored in a structured format for subsequent processing. Similarly, when combined with a biobank's clinical data, sequencing data allow researchers to discover new variants that contribute to disease onset and progression, identify drug targets for accelerating drug discovery and development, and stratify patients into groups such as those more likely to respond to treatment or experience side-effects. Moreover, provision of tools for analysing and sharing these data is crucial to maximise the use of the biospecimen and data hosted by cancer biobanks.
Breast Cancer Now Tissue Bank (BCNTB) [7] is the UK's first national breast cancer biobank, providing access to the largest collection of longitudinal biospecimens and data from breast cancer patients and individuals without the disease at various stages of the care pathway. The Bank has eight (four legacy and four active) centres in England and Scotland, with Barts Cancer Institute (BCI) acting as the coordinating centre. In this paper, we explore the biospecimens, data and tools made available by the BCNTB centre at BCI (BCNTB-BCI). We present an approach to building and maintaining a dynamic biobanking ecosystem, that features: a centralised database with built-in quality assurance mechanisms for storing multimodal data; linkage to data from external sources, including EHRs from primary and secondary care plus genomic data from national and international sequencing initiatives, for enriching the Bank's data; and an integrated biospecimen and data request system and a bioinformatics portal for sharing and analysing these data. While individual elements of such an infrastructure have been suggested in the past (e.g., see [8][9][10]), no cancer biobank has implemented all these elements, specifically for breast cancer research. Using BCNTB-BCI as an exemplar, we demonstrate how dynamic cancer biobanks can fill the gaps in breast cancer patient data, support researchers to advance breast cancer research and assist clinicians in early diagnosis and effective treatment of breast cancer for optimal outcome.

Materials and Methods
Donor Selection and Recruitment. Necessary ethics and governance approvals (NHS Research Ethics Committee reference number 21/EE/0072) are in place for the routine collection of longitudinal biospecimens and data from the donors attending breast cancer clinics across Barts Health NHS Trust (BH). They include patients undergoing biopsy or surgery for benign, atypical or malignant breast lesions as well as individuals undergoing surgery for cosmetic or risk reduction reasons. We check the National Data Opt-out list to exclude donors who have opted out from the use of their data for research. The Bank's consent form and patient information sheet are discussed in detail with eligible donors, giving them the opportunity to ask questions. Donors are informed that they can withdraw their consent at any time. Following consent, each donor is assigned a unique biobank ID which links their biospecimens to the associated data.
Biospecimens and Multimodal Data. BCNTB-BCI hosts a repository of biospecimens and multimodal associated data. These biospecimens include liquid samples (e.g., whole blood, plasma, serum and buffy coat), fresh and frozen tissue samples (e.g., tissue microarrays (TMAs), formalin-fixed paraffin-embedded (FFPE) blocks and haematoxylin and eosin (H&E) stains) and primary cells isolated from breast tissue (e.g., epithelial and myoepithelial cells and fibroblasts). Tissue samples (tumour, cancer-surrounding, benign or healthy) are generally surplus to diagnostic requirements, while liquid samples and derivatives are taken specifically for the Bank. All biospecimens are collected and prepared according to BCNTB standard operating procedures (SOPs). The associated data are manually curated by the Bank through a variety of sources (e.g., patient questionnaires and the East London Patient Record (eLPR) system [11]). These data include donor demographics and survival status, personal and family medical history, lifestyle, presentation and diagnosis, procedures (biopsy, surgery and imaging), treatments (chemotherapy and radiotherapy), follow-up reports and images (scanned H&E slides) manually extracted from clinical letters, pathology and imaging reports and vital sign readings.
Data from BCNTB-BCI and all other BCNTB centres are stored in a centralised Research Electronic Data Capture (REDCap) [12,13] database, a secure web-based data capture application for building and managing online surveys and databases to support research studies, at the Bank's coordinating centre. Since BCNTB is a multisite national biobank, we defined a core data dictionary that allows shared data processing and reporting across sites. Data from all the legacy BCNTB centres were restructured to adhere to this data dictionary. Several REDCap features are utilised to enable and improve multisite access, data entry and validation (via branching logic, calculated fields and custom scripts run through the REDCap API), audits (e.g., via custom reports) and regulatory compliance. BCNTB-BCI data are manually audited periodically, with each data point being verified by at least two members of the Bank. Furthermore, BCNTB-BCI data are mapped to concepts from the SNOMED-CT biomedical ontology to facilitate the exchange and comparison of these data with information from other sources (e.g., EHRs and genomic data). Any data distributed for research are pseudonymised.
Linkage to EHRs and Genomic Data. To supplement and validate the Bank's manually curated internal data, we considered linkage to external data from primary and secondary care as well as national and international sequencing initiatives. BCNTB-BCI data are linked to EHRs from BH via the NHS or hospital number. BH is the largest NHS trust in London with five hospitals (St Bartholomew's, The Royal London, Mile End, Newham and Whipps Cross) serving 2.5 million people across East London [14]. These EHRs include patient demographics and survival status, medical history, physical and social assessments, diagnosis, procedures, imaging reports, pathology results and histopathology reports, electronic prescriptions, maternity services data, clinician notes and cancer registries data (e.g., the Somerset Cancer Register [15]) that are consolidated from several datasets hosted or linked by BH [16]. To transform EHRs from BH into clinically relevant features for construction of cohorts suitable for breast cancer research, we performed a series of data cleaning and wrangling processes. During these processes, EHRs were assessed with regards to their syntax (e.g., format), schema (e.g., data model and structure) and semantics (e.g., clinical interpretation); duplicates, missing values and discrepancies were flagged to be manually reviewed and compared to information from other sources. In addition, we used natural language processing (NLP) techniques to extract a variety of information from the unstructured part of these EHRs, and stored it in a structured format (e.g., mapped to the SNOMED-CT concepts). This information includes mentions of cancer-related and non-cancer-related symptoms, conditions, drugs and outcomes. For instance, we applied NLP to over 45,000 free-text reports of various imaging procedures (e.g., mammogram, computed tomography (CT), magnetic resonance imaging (MRI) and ultrasound (US)) to extract information that can be used to predict breast cancer recurrence (Abdollahyan et al., manuscript in preparation).
To date, tumour and normal samples from many BCNTB donors have been sequenced and the results made available to researchers through repositories such as the European Genome-phenome Archive (EGA) [17]. Moreover, BCNTB-BCI data are linked to sequencing data from the 100,000 Genomes Project (100K GP) [18] via a unique ID for over 320 BCNTB donors who are dually consented by BCNTB-BCI and 100K GP. These data, where available, include primary clinical data for the 100K GP participants and secondary clinical data received from NHS Digital and the National Cancer Registration and Analysis Service [19].
Biospecimen and Data Request System. The Bank's expression of interest (EOI) and application forms are implemented in REDCap. The forms are accompanied by Sample Finder, a tool implemented using an open-source Python web development framework [20] that allows researchers to search and filter biospecimens by donor and tumour characteristics. Data shown in Sample Finder are exported from the centralised REDCap database where BCNTB data are stored, reflecting the most up-to-date clinical and sample data available at the time. Using Sample Finder, researchers can instantly see how many cases match their defined cohort and view a summary report in the form of a graphical breakdown of the matching cases' donor and tumour statistics. In case of successful biospecimen and data request applications, researchers are invited to return the data generated as part of their study to the Bank after the study results enter the public domain. The Biospecimen and Data Request System is available at https://breastcancernow.org/breast-cancer-research/ breast-cancer-now-tissue-bank/how-apply-tissue-bank accessed on 15 February 2023.
Bioinformatics Portal. The Bank hosts its own bioinformatics portal, implemented using open-source Python and R packages [21], that includes Analytics Hub, a data analytics platform. The portal acts as a central hub, giving researchers access to a list of associated publications, user guides and the Bank's contact information.
Analytics Hub, powered by a custom version of SNPnexus [22], provides an environment on which the output of Whole Genome Sequencing pipelines (e.g., FASTQ and VCF files), clinical and omics data returned by studies that used BCNTB-BCI biospecimens (e.g., the Spatial Characterisation [23] and the 100K GP cohort datasets) and publicly available data from large-scale studies (e.g., the Cancer Genome Atlas (TCGA) [24], the Cancer Cell Line Encyclopedia (CCLE) [25] and the International Cancer Genome Consortium (ICGC) [26] datasets) can be viewed, filtered, analysed and incorporated into research projects and grant applications. Analytics Hub has bidirectional links to Sample Finder, i.e., upon entering a query in Analytics Hub, the platform also checks the availability of biospecimens and data from BCNTB-BCI donors with the same characteristics on Sample Finder, for which researchers can submit an EOI to the bank; inversely, upon entering a query in Sample Finder, the tool also checks the availability of data for patients from other studies with the same characteristics of interest, which researchers can explore on Analytics Hub. The Bioinformatics Portal is available at https://bcntb.bcc.qmul.ac.uk/home accessed on 15 February 2023. available at the time. Using Sample Finder, researchers can instantly see how many cases match their defined cohort and view a summary report in the form of a graphical breakdown of the matching cases' donor and tumour statistics. In case of successful biospecimen and data request applications, researchers are invited to return the data generated as part of their study to the Bank after the study results enter the public domain. The Biospecimen and Data Request System is available at https://breastcancernow.org/breast-cancer-research/breast-cancer-now-tissue-bank/how-apply-tissue-bank. Bioinformatics Portal. The Bank hosts its own bioinformatics portal, implemented using open-source Python and R packages [21], that includes Analytics Hub, a data analytics platform. The portal acts as a central hub, giving researchers access to a list of associated publications, user guides and the Bank's contact information.

BCNTB-BCI Infrastructure
Analytics Hub, powered by a custom version of SNPnexus [22], provides an environment on which the output of Whole Genome Sequencing pipelines (e.g., FASTQ and VCF files), clinical and omics data returned by studies that used BCNTB-BCI biospecimens (e.g., the Spatial Characterisation [23] and the 100K GP cohort datasets) and publicly available data from large-scale studies (e.g., the Cancer Genome Atlas (TCGA) [24], the Cancer Cell Line Encyclopedia (CCLE) [25] and the International Cancer Genome Consortium (ICGC) [26] datasets) can be viewed, filtered, analysed and incorporated into research projects and grant applications. Analytics Hub has bidirectional links to Sample Finder, i.e., upon entering a query in Analytics Hub, the platform also checks the availability of biospecimens and data from BCNTB-BCI donors with the same characteristics on Sample Finder, for which researchers can submit an EOI to the bank; inversely, upon entering a query in Sample Finder, the tool also checks the availability of data for patients from other studies with the same characteristics of interest, which researchers can explore on Analytics Hub. The Bioinformatics Portal is available at https://bcntb.bcc.qmul.ac.uk/home. Figure 1 shows the overview of the described BCNTB-BCI infrastructure.

BCNTB-BCI Cohort Overview
To date, BCNTB-BCI has collected over 42,000 biospecimens and data from 2400 breast cancer patients and individuals without the disease (6% of donors approached for consent have dissented). Longitudinal liquid samples and derivatives, tissue samples and cell cultures are available for 17%, 99% and 21% of BCNTB-BCI donors, respectively, with

BCNTB-BCI Cohort Overview
To date, BCNTB-BCI has collected over 42,000 biospecimens and data from 2400 breast cancer patients and individuals without the disease (6% of donors approached for consent have dissented). Longitudinal liquid samples and derivatives, tissue samples and cell cultures are available for 17%, 99% and 21% of BCNTB-BCI donors, respectively, with samples collected at the time of diagnosis and through the treatment cycles until the last follow-up or death.
Linkage of BCNTB-BCI data to EHRs allows a quick inspection of the breast cancer landscape in our consented cohort. We selected a subset of BCNTB-BCI donors (data extracted on 22 September 2022) who have been followed up for at least 5 years or until death (whichever is earlier) and for whom age at diagnosis, ethnicity, disease and survival status data are available and independently verified throughout this period (n = 1066). Figures 2-4 show the evolution of disease status in this subset of donors over 5 years of follow-up (the period during which risk of breast cancer recurrence is high), the prevalence and co-occurrence of conditions that are often comorbid with breast cancer in these patients, and their age at diagnosis of primary breast cancer and ethnicity, respectively. Our results suggest that breast cancer screening uptake in the consented population of East London covered by BH is low, resulting in many cases presenting at later stages of the disease. For instance, the de novo metastatic breast cancer rate is at least 2% and the incidence of recurrence is the highest in the first year after diagnosis ( Figure 2). Oestrogen receptor (ER) negative breast cancer cases make up the majority of recurrence cases diagnosed in this period. Additionally, the results show that the most common comorbid condition is cardiovascular disease with a prevalence rate of 38%, followed by hypertension and rheumatic disease (Figure 3). This is in agreement with statistics reported in numerous other studies that investigated the prevalence of comorbidities in breast cancer patients (e.g., see [27]). However, our results also highlight the challenges in breast cancer detection and treatment in the multi-ethnic population of East London, where over 20% and over 30% of patients are Black and Asian [14], respectively. Overall survival of patients, comprised of relapse-free (the 'Primary' category) and secondary breast cancer (the 'Recurrence' category) cases, is worse in these ethnic groups. For example, the 5-year overall survival rate for our consented Black group is 78%, which is lower than the rate for all BCNTB-BCI donors (84%, Figure 2) and the national average of 85% [1]. Similarly, the age at diagnosis is younger particularly in the Asian and Black groups (Figure 4). In addition, the results highlight some of the challenges in breast cancer management in this region. For example, the comorbidity burden is high, with 14% of patients experiencing three or more comorbidities (Figure 3). landscape in our consented cohort. We selected a subset of BCNTB-BCI donors (data extracted on 22 September 2022) who have been followed up for at least 5 years or until death (whichever is earlier) and for whom age at diagnosis, ethnicity, disease and survival status data are available and independently verified throughout this period (n = 1066). Figures 2-4 show the evolution of disease status in this subset of donors over 5 years of follow-up (the period during which risk of breast cancer recurrence is high), the prevalence and co-occurrence of conditions that are often comorbid with breast cancer in these patients, and their age at diagnosis of primary breast cancer and ethnicity, respectively. Our results suggest that breast cancer screening uptake in the consented population of East London covered by BH is low, resulting in many cases presenting at later stages of the disease. For instance, the de novo metastatic breast cancer rate is at least 2% and the incidence of recurrence is the highest in the first year after diagnosis ( Figure 2). Oestrogen receptor (ER) negative breast cancer cases make up the majority of recurrence cases diagnosed in this period. Additionally, the results show that the most common comorbid condition is cardiovascular disease with a prevalence rate of 38%, followed by hypertension and rheumatic disease (Figure 3). This is in agreement with statistics reported in numerous other studies that investigated the prevalence of comorbidities in breast cancer patients (e.g., see [27]). However, our results also highlight the challenges in breast cancer detection and treatment in the multi-ethnic population of East London, where over 20% and over 30% of patients are Black and Asian [14], respectively. Overall survival of patients, comprised of relapse-free (the 'Primary' category) and secondary breast cancer (the 'Recurrence' category) cases, is worse in these ethnic groups. For example, the 5-year overall survival rate for our consented Black group is 78%, which is lower than the rate for all BCNTB-BCI donors (84%, Figure 2) and the national average of 85% [1]. Similarly, the age at diagnosis is younger particularly in the Asian and Black groups (Figure 4). In addition, the results highlight some of the challenges in breast cancer management in this region. For example, the comorbidity burden is high, with 14% of patients experiencing three or more comorbidities (Figure 3).

Clinical and Molecular Patient Journey Narratives
The infrastructure of our cancer biobank facilitates the generation of clinical and m lecular patient journey narratives, which are powerful tools for precision medicine. Fig  5 shows the clinical patient journey narrative for a breast cancer patient (referred to

Clinical and Molecular Patient Journey Narratives
The infrastructure of our cancer biobank facilitates the generation of clinical and molecular patient journey narratives, which are powerful tools for precision medicine. Figure 5 shows the clinical patient journey narrative for a breast cancer patient (referred to as Patient A) over 5 years (denoted by I to V in the figures) based on their biobank data from BCNTB-BCI. These data are mostly structured, which eases visualisation and interpretation of patient journey narratives.  Figures 6 and 7 show the clinical patient journey narratives for Patient A based on the structured and unstructured parts (specifically imaging reports) of their EHRs from BH, respectively. EHRs differ from biobank data in terms of what information is recorded, and when and how this information is recorded: compared to BCNTB-BCI data, EHRs from BH provide a better coverage of clinical events, which makes them suitable for certain studies (e.g., comorbidity analysis); however, they are not necessarily cancer-specific, and therefore, require filtering before they can be used to visualise cancer patient trajectories. In BCNTB-BCI data, clinical events appear sooner and with a single date, whereas in EHRs from BH, codes may have been entered multiple times with different dates. As a cancer-oriented biobank, BCNTB records cancer-related information explicitly and mostly in a structured format, while in EHRs, the same information may not be directly available or are recorded in an unstructured format. For instance, BCNTB-BCI records age at diagnosis, family history of cancer and treatment regimen, including the response and any side effects, explicitly. In contrast, using EHRs from BH, we calculated age at diagnosis based on date of birth and date of the earliest instance of the relevant diagnosis code. Likewise, we filtered family history data by relation (first-degree relative or otherwise) and disease to obtain cancer-specific family history information. Similarly, drugs that are part of the same regimen are not grouped together, and the response to treatment and any side effects are usually mentioned in free-text reports; we extracted and aggregated this information.  Figures 6 and 7 show the clinical patient journey narratives for Patient A based on the structured and unstructured parts (specifically imaging reports) of their EHRs from BH, respectively. EHRs differ from biobank data in terms of what information is recorded, and when and how this information is recorded: compared to BCNTB-BCI data, EHRs from BH provide a better coverage of clinical events, which makes them suitable for certain studies (e.g., comorbidity analysis); however, they are not necessarily cancer-specific, and therefore, require filtering before they can be used to visualise cancer patient trajectories. In BCNTB-BCI data, clinical events appear sooner and with a single date, whereas in EHRs from BH, codes may have been entered multiple times with different dates. As a canceroriented biobank, BCNTB records cancer-related information explicitly and mostly in a structured format, while in EHRs, the same information may not be directly available or are recorded in an unstructured format. For instance, BCNTB-BCI records age at diagnosis, family history of cancer and treatment regimen, including the response and any side effects, explicitly. In contrast, using EHRs from BH, we calculated age at diagnosis based on date of birth and date of the earliest instance of the relevant diagnosis code. Likewise, we filtered family history data by relation (first-degree relative or otherwise) and disease to obtain cancer-specific family history information. Similarly, drugs that are part of the same regimen are not grouped together, and the response to treatment and any side effects are usually mentioned in free-text reports; we extracted and aggregated this information.
side effects, explicitly. In contrast, using EHRs from BH, we calculated age at diagnosis based on date of birth and date of the earliest instance of the relevant diagnosis code. Likewise, we filtered family history data by relation (first-degree relative or otherwise) and disease to obtain cancer-specific family history information. Similarly, drugs that are part of the same regimen are not grouped together, and the response to treatment and any side effects are usually mentioned in free-text reports; we extracted and aggregated this information.  Using 100K GP data, we identified somatic variants for Patient A using the Analytics Hub. These variants were used as input query set to the Cancer Genome Interpreter (CGI, release 28) [29], which produced a list of potential therapeutically actionable candidates. In the case of this patient, TP53 D281N variant was listed as a biomarker of drug response. Figure 8 shows the drug sensitivity profiles for this variant. According to EHRs from BH, the patient was administered a combination chemotherapy regimen of carboplatin, gemcitabin and paclitaxel, and according to BCNTB-BCI data, the patient had an initial partial response followed by disease progression and died shortly after. The predictive power of TP53 mutations has not yet been sufficiently validated to permit conclusions on the pharmacogenomics to be drawn. Genomic insights, in conjunction with information from other sources (e.g., biobank data and EHRs), not only provide researchers with richer datasets, but also guide clinicians in their diagnosis and management decision-making process. Using 100K GP data, we identified somatic variants for Patient A using the Analytics Hub. These variants were used as input query set to the Cancer Genome Interpreter (CGI, release 28) [29], which produced a list of potential therapeutically actionable candidates. In the case of this patient, TP53 D281N variant was listed as a biomarker of drug response. Figure 8 shows the drug sensitivity profiles for this variant. According to EHRs from BH, the patient was administered a combination chemotherapy regimen of carboplatin, gemcitabin and paclitaxel, and according to BCNTB-BCI data, the patient had an initial partial response followed by disease progression and died shortly after. The predictive power of TP53 mutations has not yet been sufficiently validated to permit conclusions on the pharmacogenomics to be drawn. Genomic insights, in conjunction with information from other sources (e.g., biobank data and EHRs), not only provide researchers with richer datasets, but also guide clinicians in their diagnosis and management decision-making process.

Discussion
We adopted a dynamic infrastructure using REDCap for the management of BCNTB data and provision of its biospecimens. We implemented and customised measures that improve the quality (correctness, completeness, consistency and currency or timeliness) of the Bank's data. Additionally, mapping our data to a standard biomedical ontology (the SNOMED-CT concepts) facilitated their integration and interoperability with EHRs and genomic data. This resulted in a rich longitudinal multimodal dataset that captures the breast cancer landscape in a multi-ethnic population and highlights the shortcomings in

Discussion
We adopted a dynamic infrastructure using REDCap for the management of BCNTB data and provision of its biospecimens. We implemented and customised measures that improve the quality (correctness, completeness, consistency and currency or timeliness) of the Bank's data. Additionally, mapping our data to a standard biomedical ontology (the SNOMED-CT concepts) facilitated their integration and interoperability with EHRs and genomic data. This resulted in a rich longitudinal multimodal dataset that captures the breast cancer landscape in a multi-ethnic population and highlights the shortcomings in detection, management and treatment of breast cancer in the local area. Previous studies examining breast cancer presentation in this geographical region carried out surveys (via patient questionnaires and interviews) to collect data, which rely on large samples that are representative of the target population and high response rates [30]. BCNTB-BCI data can supplement, validate and guide such studies.
The BCNTB integrated Biospecimen and Data Request System offers real-time information on the availability of biospecimens and data to researchers, while simplifying the data sharing process to maximise data reuse. The BCNTB integrated Bioinformatics Portal enables the research community not only to access high-quality annotated biospecimens, but also to interrogate the associated data. Analysing various types of data available for each patient is key to implementing personalised diagnostic, prognostic and therapeutic strategies. It reduces research time and complexity, while the seamless data exchange and navigation between Sample Finder and Analytics Hub helps overcome the barriers due to lack of data compatibility. This created a domain-specific repository of breast cancer data and tools that can be used to generate clinical and molecular patient journey narratives. Such narratives offer context and a cross-functional view of patient experience that cannot be achieved without employing these tools and by interrogating data from either source alone, or engage all stakeholders to identify gaps in breast cancer care.
Our observations are limited to a consented subset of donors followed up for over 5 years, and therefore, need to be validated in a larger unbiased cohort. Another limitation is the impact of donors who are lost to follow-up. Linkage of BCNTB-BCI data to primary care EHRs from the Discovery East London programme [31] is currently in progress. This will enable us to generate near-complete data for more comprehensive clinical patient journey narratives by enriching the demographics, lifestyle, comorbidity and mortality data that have so far been recorded in hospital visits.

Conclusions
It is important to note some of the challenges and lessons learned from our work. The underlying infrastructure of a biobank must offer scalability and interoperability, allowing data growth and flow. Biobanking staff at multiple sites need a user-friendly system that supports entry, querying, management and auditing of clinical and sample data. Researchers need a system that enables them to search and request biospecimens and data. They also need the tools by which data generated from the bank's biospecimens as well as related initiatives can be analysed. Importantly, the infrastructure of a biobank should evolve in response to feedback from clinicians and researchers.
A challenge faced by multisite biobanks is the variation in terminology used both within and between collaborating centres. This is an issue, especially for retrospective collections, that must be resolved through the implementation of a controlled vocabulary for all the biobank's operating centres using international standard biomedical ontologies. Another challenge is managing the collections from legacy sites and facilitating the onboarding of new sites. Furthermore, extracting information from EHRs requires domain experts to be involved from the start of any biobanking project. At BCI, we work closely with our NHS Trust to develop knowledge and best practices for transforming health records into breast cancer research-ready data. This, however, poses a challenge when patients are lost to follow-up at BH.
Data generated from a biobank's biospecimens are its legacy. Sharing these data helps avoid duplication of effort and financial resources for researchers and depletion of rare biospecimens, and allows researchers to conduct preliminary analysis and validation of ongoing projects. Data sharing is vital for translation of research to patient benefit. Biobanks can be a powerful catalyst in tackling resistance to data sharing.
We present BCNTB-BCI as a dynamic cancer biobank with an infrastructure that includes several features to help tackle these challenges in breast cancer research. They include: Implementation of a flexible yet robust data management system; linkage to other health data sources; and integrated data access and analytics tools. We demonstrated how these features improve data quality, increase interoperability, encourage data reuse and accelerate research. Specifically, we showcased an example where such a biobanking ecosystem can aid in predicting the health trajectory of breast cancer patients and identifying potential actionable opportunities for intervention.

Institutional Review Board Statement:
The study was conducted in accordance with the Declaration of Helsinki, and approved by the NHS Research Ethics Committee (reference number 21/EE/0072).

Informed Consent Statement:
Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author. Patient data are not publicly available due to ethical restrictions.